Defining Categorical Types
At the end of this topic you should be able to:
factor data types.In this brief presentation, we’ll be introducing the following items:
Unique and individual grouping that can be applied to a study design.
character typeThe function sample() allows us to take a random sample of elements from a vector of potential values.
However, if we want a large number items, we can have them with or without replacement.
We’ll pretend we have a bunch of data related to the day of the week.
Length Class Mode
40 character character
[1] "Sunday" "Monday" "Sunday" "Monday" "Wednesday" "Thursday"
[7] "Tuesday" "Saturday" "Tuesday" "Tuesday" "Wednesday" "Tuesday"
[13] "Sunday" "Saturday" "Thursday" "Friday" "Thursday" "Wednesday"
[19] "Friday" "Monday" "Tuesday" "Sunday" "Thursday" "Tuesday"
[25] "Tuesday" "Monday" "Monday" "Saturday" "Wednesday" "Wednesday"
[31] "Thursday" "Monday" "Friday" "Saturday" "Tuesday" "Saturday"
[37] "Saturday" "Monday" "Sunday" "Saturday"
factor [1] Sunday Monday Sunday Monday Wednesday Thursday Tuesday
[8] Saturday Tuesday Tuesday Wednesday Tuesday Sunday Saturday
[15] Thursday Friday Thursday Wednesday Friday Monday Tuesday
[22] Sunday Thursday Tuesday Tuesday Monday Monday Saturday
[29] Wednesday Wednesday Thursday Monday Friday Saturday Tuesday
[36] Saturday Saturday Monday Sunday Saturday
Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday
Notice Levels.
Each factor variable is defined by the levels that constitute the data. This is a finite set of unique values.
If a factor is not ordinal, it does not allow the use relational comparison operators.
Where ordination matters:
Fertilizer Treatments in KG of N2 per hectare: 10 kg N2, 20 N2, 30 N2,
Days of the Week: Friday is not followed by Monday,
Life History Stage: seed, seedling, juvenile, adult, etc.
Where ordination is irrelevant:
River
State or Region
Sample Location
[1] Sunday Monday Sunday Monday Wednesday Thursday Tuesday
[8] Saturday Tuesday Tuesday Wednesday Tuesday Sunday Saturday
[15] Thursday Friday Thursday Wednesday Friday Monday Tuesday
[22] Sunday Thursday Tuesday Tuesday Monday Monday Saturday
[29] Wednesday Wednesday Thursday Monday Friday Saturday Tuesday
[36] Saturday Saturday Monday Sunday Saturday
7 Levels: Friday < Monday < Saturday < Sunday < Thursday < ... < Wednesday
The problem is that the default ordering is actually alphabetical!
Specifying the Order of Ordinal Factors
[1] Sunday Monday Sunday Monday Wednesday Thursday Tuesday
[8] Saturday Tuesday Tuesday Wednesday Tuesday Sunday Saturday
[15] Thursday Friday Thursday Wednesday Friday Monday Tuesday
[22] Sunday Thursday Tuesday Tuesday Monday Monday Saturday
[29] Wednesday Wednesday Thursday Monday Friday Saturday Tuesday
[36] Saturday Saturday Monday Sunday Saturday
7 Levels: Monday < Tuesday < Wednesday < Thursday < Friday < ... < Sunday
[1] Monday Monday Monday Monday Monday Monday Monday
[8] Tuesday Tuesday Tuesday Tuesday Tuesday Tuesday Tuesday
[15] Tuesday Wednesday Wednesday Wednesday Wednesday Wednesday Thursday
[22] Thursday Thursday Thursday Thursday Friday Friday Friday
[29] Saturday Saturday Saturday Saturday Saturday Saturday Saturday
[36] Sunday Sunday Sunday Sunday Sunday
7 Levels: Monday < Tuesday < Wednesday < Thursday < Friday < ... < Sunday
You cannot assign a value to a factor that is not one of the pre-defined levels.
forcatsforcats libraryPart of the tidyverse group of packages.
This library has a lot of helper functions that make working with factors a bit easier. I’m going to give you a few examples here but strongly encourage you to look a the cheat sheet for all the other options.
A summary of levels.
Coalescing low frequency samples.
We can reorder by one of several factors including: appearance order, # of observations, or numeric
By Appearance
[1] "Sunday" "Monday" "Wednesday" "Thursday" "Tuesday" "Saturday"
[7] "Friday"
This pulls out specific level instances and puts them in the lead position in the order that you give them following the vector of data types.
There are times when some subset of your data does not include every an example of each level, here is how you drop them.
iris DataBritish polymath, mathematician, statistican, geneticist, and academic. Founded things such as:
F test,Exact test, Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
Question: What is the mean and variance in sepal length for each of the Iris species?
The by() function allows us to perform some function on data based upon a grouping index.
by() the ‘Classic Approach’Here we can apply the function mean() to the data on sepal length using the species factor as a category.
by()The same for estimating variance
group_by() the Tidy WayFor a less “money sign” approach…
We have already seen the binomial distribution that is applicable for data that has 2-categories.
\[ P(K|N,p) = \frac{N!}{K!(N-K!)}p^K(1-p)^{N-K} \]
\(H_O: p = \hat{p}\)
Consider the following sampling of catfish from our hypothetical examples.
[1] Other Other Catfish Other Other Catfish Other Catfish Catfish
[10] Catfish Other Other Catfish Catfish Catfish Catfish Catfish Catfish
[19] Other Other Catfish Catfish Catfish Catfish Other Catfish Catfish
[28] Catfish Other Catfish Catfish Catfish Other Catfish Other Catfish
[37] Catfish Other Catfish Catfish Catfish Other Catfish Catfish Catfish
[46] Other Catfish Catfish Catfish Other
Levels: Catfish Other
We could test the hypothesis that The frequency of catfish in this sampling effort is not significantly different our previously observed frequency of 74%.
Formally, we would specify.
\(H_O: p = 0.74\)
and test it against the alternative hypothesis
\(H_A: p \ne 0.74\)
Exact binomial test
data: 33 and 50
number of successes = 33, number of trials = 50, p-value = 0.1991
alternative hypothesis: true probability of success is not equal to 0.74
95 percent confidence interval:
0.5123475 0.7879453
sample estimates:
probability of success
0.66
From this, we can see the following components:
TRUE frequency is \(0.74\),For completeness, this approach can be extended to data with more than two categroies using the multinomial distribution.
\[ P(X_1 = p_1, X_2 = p_2, \ldots, X_K = p_k|N) = \frac{N!}{\prod_{i=1}^{K}x_i!}\prod_{i=1}^Kp^{x_i} \]
We will skip this for simplicity as there are other ways we can evaluate it.
A contingency table is a method to determine if the count of the data we see in 2 or more categories conform to expectations based upon hypothesized frequencies. Contingency tables use expectations based upon the \(\chi^2\) distribution to evaluate statistical confidence.
\[ P(x) = \frac{1}{2^{k/2}\Gamma(k/2)}x^{k/2-1}e^{-x/2} \]Thinking a bit differently, when we originally specified the expectation of the number of catfish, we only estimated \(p_{catfish}\) and said \(q = 1-p =\; not \; catfish\). For contingency tables, we use these as a the vector of probabilities and call this our expected frequencies.
\[ p = \left( \frac{37}{50}, \frac{13}{50} \right) = (0.74,0.26) \]
We can use this as a general framework for 2 or more categories of observations.
\[ E = [E_1, E_2] \]
If we had \(K\) different species of fishes, it could similarily be expanded to something like:
\[ E = [E_1, E_2, \ldots, E_K] \]
This can be used to determine the expected number of observed values for any arbitrary sampling. For example, if you sampled 172 fishes, you would expect the following proportions.
Which can be generalized as the vector of Observed data,
\[ O = [O_1, O_2] \]
The test statistic here, \(T\), is defined as the standardized distance between observed and expected observations.
\[ T = \sum_{i=1}^c \frac{(O_i - E_i)^2}{E_i} \]
Which is distributed (with some assumptions) as a \(\chi^2\) random variable.
\(H_O: O = E\)
In R, we can test this using the chisq.test() function.
chisq.test(x, y = NULL, correct = TRUE,
p = rep(1/length(x), length(x)), rescale.p = FALSE,
simulate.p.value = FALSE, B = 2000)
The salient points are:
We can expand this approach to two or more categorical data types. Consider the following data data set in R for hair and eye colors collected in 1974 at the University of Delaware (n.b., this is a 3-dimensional matrix of observations):
For simplicity, I’m going to combine the 3rd dimension to produce a singel 4x4 matrix of observations.
Eye
Hair Brown Blue Hazel Green
Black 68 20 15 5
Brown 119 84 54 29
Red 26 17 14 14
Blond 7 94 10 16
Assumptions:
1. Each row is a random sample relative to the column.
2. Each column is a random sample relative to the row.
Is a random sample whose proportions
are one sample from the larger set of data (remaining columns) whose proportions are:
The raw data consists of a matrix of values.
\[ \begin{bmatrix} O_{11} & O_{12} & \ldots & O_{1c} \\ O_{21} & O_{22} & \ldots & \vdots \\ \vdots & \vdots & \ddots & \vdots \\ O_{r1} & O_{r2} & \ldots & O_{rc} \end{bmatrix} \]
The factor represented by the row variable can be summed across all columns.
\[ R = \begin{bmatrix} R_{1} \\ R_{2} \\ \vdots \\ R_{r} \end{bmatrix} \]
And the data represented by the columns can be summed as:
\[ C = \begin{bmatrix} C_{1} & C_{2} & \ldots & C_{c} \end{bmatrix} \]
And the total number of observations, \(N\), is given by
\[ N = \sum_{i=1}^r R_i \]
or
\[ N = \sum_{j=1}^c{C_j} \]
For contingency tables, we are assuming, under the null hypothesis, that the variables represented in counts of rows and columns are independent of each other.
\(H_O:\) The event ‘an observation in row i’ is independent of the event ‘the same observation in column j’ for all i and j.
Or if we want to shorten it
\(H_O:\) Hair and eye colors are independent traits.
The expected values for each row (i) and column (j) are estimated as:
\[ E_{i,j} = R_i * C_j / N \]
Similar to the one-row example above, the test statistic for larger tables are the standardized differences between observed and expected values.
\[ T = \sum_{i=1}^r\sum_{j=1}^c\frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]
Which is evaluated against the \(\chi^2\) distribution to ascertain statistical confidence.
RIn R, it is slightly simplified as:
Which, as normal, returns a list-like response with all the data we need.
It is a bit less verbose than other examples but has the needed information.